This is based on the Kaggle dataset available at https://www.kaggle.com/competitions/store-sales-time-series-forecasting. We are to build a model that will accurately predict the unit sales of multiple items at different stores of a corporation.
# Download data
# Needs a kaggle.json file in C:\Users\<Username>\.kaggle
# import kaggle
# ! kaggle competitions download -c store-sales-time-series-forecasting -q
# ! unzip store-sales-time-series-forecasting
Archive: store-sales-time-series-forecasting.zip inflating: holidays_events.csv inflating: oil.csv inflating: sample_submission.csv inflating: stores.csv inflating: test.csv inflating: train.csv inflating: transactions.csv
!conda install -c plotly plotly-orca
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)
import fbprophet
import pmdarima as pm
%matplotlib inline
First, let's import the train and test data and have a look.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head(10)
| id | date | store_nbr | family | sales | onpromotion | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.0 | 0 |
| 1 | 1 | 2013-01-01 | 1 | BABY CARE | 0.0 | 0 |
| 2 | 2 | 2013-01-01 | 1 | BEAUTY | 0.0 | 0 |
| 3 | 3 | 2013-01-01 | 1 | BEVERAGES | 0.0 | 0 |
| 4 | 4 | 2013-01-01 | 1 | BOOKS | 0.0 | 0 |
| 5 | 5 | 2013-01-01 | 1 | BREAD/BAKERY | 0.0 | 0 |
| 6 | 6 | 2013-01-01 | 1 | CELEBRATION | 0.0 | 0 |
| 7 | 7 | 2013-01-01 | 1 | CLEANING | 0.0 | 0 |
| 8 | 8 | 2013-01-01 | 1 | DAIRY | 0.0 | 0 |
| 9 | 9 | 2013-01-01 | 1 | DELI | 0.0 | 0 |
Here, we have the following features:
Let us check the data types, the missing values, and the unique values, and see whether any preprocessing needs to be done.
print('Column datatypes in train:')
print(train.dtypes,'\n')
print('Missing values in train:')
print(train.isnull().sum(),'\n')
print('Missing values in test:')
print(test.isnull().sum(),'\n')
print('Number of unique stores:',len(train['store_nbr'].unique()),'\n')
print('Unique product families:',train['family'].unique(),'\n')
print('Number of unique product families:',len(train['family'].unique()),'\n')
print('Data range of train:',train['date'].iloc[0],'to',train['date'].iloc[-1])
print('Data range of test:',test['date'].iloc[0],'to',test['date'].iloc[-1])
Column datatypes in train: id int64 date object store_nbr int64 family object sales float64 onpromotion int64 dtype: object Missing values in train: id 0 date 0 store_nbr 0 family 0 sales 0 onpromotion 0 dtype: int64 Missing values in test: id 0 date 0 store_nbr 0 family 0 onpromotion 0 dtype: int64 Number of unique stores: 54 Unique product families: ['AUTOMOTIVE' 'BABY CARE' 'BEAUTY' 'BEVERAGES' 'BOOKS' 'BREAD/BAKERY' 'CELEBRATION' 'CLEANING' 'DAIRY' 'DELI' 'EGGS' 'FROZEN FOODS' 'GROCERY I' 'GROCERY II' 'HARDWARE' 'HOME AND KITCHEN I' 'HOME AND KITCHEN II' 'HOME APPLIANCES' 'HOME CARE' 'LADIESWEAR' 'LAWN AND GARDEN' 'LINGERIE' 'LIQUOR,WINE,BEER' 'MAGAZINES' 'MEATS' 'PERSONAL CARE' 'PET SUPPLIES' 'PLAYERS AND ELECTRONICS' 'POULTRY' 'PREPARED FOODS' 'PRODUCE' 'SCHOOL AND OFFICE SUPPLIES' 'SEAFOOD'] Number of unique product families: 33 Data range of train: 2013-01-01 to 2017-08-15 Data range of test: 2017-08-16 to 2017-08-31
The following observations can be made:
Let us start with the transformation of the dates column. We also do not need the id column for our analysis, so we can remove it. This will be needed during submission, since we have to submit predicted sales for a given sale id.
for df in [train,test]:
# Convert date column to date object
df['date'] = pd.to_datetime(df['date'])
# Remove id column
df.drop('id', axis=1, inplace=True)
train.head(10)
| date | store_nbr | family | sales | onpromotion | |
|---|---|---|---|---|---|
| 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.0 | 0 |
| 1 | 2013-01-01 | 1 | BABY CARE | 0.0 | 0 |
| 2 | 2013-01-01 | 1 | BEAUTY | 0.0 | 0 |
| 3 | 2013-01-01 | 1 | BEVERAGES | 0.0 | 0 |
| 4 | 2013-01-01 | 1 | BOOKS | 0.0 | 0 |
| 5 | 2013-01-01 | 1 | BREAD/BAKERY | 0.0 | 0 |
| 6 | 2013-01-01 | 1 | CELEBRATION | 0.0 | 0 |
| 7 | 2013-01-01 | 1 | CLEANING | 0.0 | 0 |
| 8 | 2013-01-01 | 1 | DAIRY | 0.0 | 0 |
| 9 | 2013-01-01 | 1 | DELI | 0.0 | 0 |
train.dtypes
date datetime64[ns] store_nbr int64 family object sales float64 onpromotion int64 dtype: object
We have parsed the dates, and we have removed the id column. We can do some preliminary analysis of the data in train now.
Since we have data for each store, we can write a function that will calculate mean of a column over a given time frequency. We can also use interpolation to fill in missing values for days on which data was not recorded. We know it cannot be zero since we have been explicitly given 0 sales when there were no sales.
def freq_resample(df, freq, col, key='date'):
df_resampled = df.resample(freq, on=key)[col].mean().reset_index()
df_resampled[col] = df_resampled[col].interpolate()
return df_resampled
train_monthly = freq_resample(train, freq='M', col='sales')
display(train_monthly.head(10))
| date | sales | |
|---|---|---|
| 0 | 2013-01-31 | 186.952405 |
| 1 | 2013-02-28 | 193.581846 |
| 2 | 2013-03-31 | 206.880581 |
| 3 | 2013-04-30 | 205.639071 |
| 4 | 2013-05-31 | 209.943594 |
| 5 | 2013-06-30 | 218.655893 |
| 6 | 2013-07-31 | 203.783364 |
| 7 | 2013-08-31 | 212.479434 |
| 8 | 2013-09-30 | 220.593588 |
| 9 | 2013-10-31 | 213.164266 |
# Sales analysis by store number
df = train.groupby('store_nbr')['sales'].mean().reset_index().sort_values(by='sales', ascending=False)
fig = px.bar(df, x='store_nbr', y='sales', title='Averaged Sales by Store Number')
fig.show()
# Sales analysis by product family
df = train.groupby('family')['sales'].mean().reset_index().sort_values(by='sales', ascending=True)
fig = px.bar(df, x='sales', y='family', orientation='h', title='Averaged Sales by Product Family')
fig.show()
We can make the following observations:
Next, we should look into the stores data.
stores = pd.read_csv('stores.csv')
stores.head()
| store_nbr | city | state | type | cluster | |
|---|---|---|---|---|---|
| 0 | 1 | Quito | Pichincha | D | 13 |
| 1 | 2 | Quito | Pichincha | D | 13 |
| 2 | 3 | Quito | Pichincha | D | 8 |
| 3 | 4 | Quito | Pichincha | D | 9 |
| 4 | 5 | Santo Domingo | Santo Domingo de los Tsachilas | D | 4 |
We have the following features:
We can add this information to the train and test datasets by joining on the store numbers.
train = train.merge(stores, on='store_nbr', how='left')
test = test.merge(stores, on='store_nbr', how='left')
train.head(10)
| date | store_nbr | family | sales | onpromotion | city | state | type | cluster | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 1 | 2013-01-01 | 1 | BABY CARE | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 2 | 2013-01-01 | 1 | BEAUTY | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 3 | 2013-01-01 | 1 | BEVERAGES | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 4 | 2013-01-01 | 1 | BOOKS | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 5 | 2013-01-01 | 1 | BREAD/BAKERY | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 6 | 2013-01-01 | 1 | CELEBRATION | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 7 | 2013-01-01 | 1 | CLEANING | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 8 | 2013-01-01 | 1 | DAIRY | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 9 | 2013-01-01 | 1 | DELI | 0.0 | 0 | Quito | Pichincha | D | 13 |
Next, we look at transactions data.
trans = pd.read_csv('transactions.csv')
trans.head(10)
| date | store_nbr | transactions | |
|---|---|---|---|
| 0 | 2013-01-01 | 25 | 770 |
| 1 | 2013-01-02 | 1 | 2111 |
| 2 | 2013-01-02 | 2 | 2358 |
| 3 | 2013-01-02 | 3 | 3487 |
| 4 | 2013-01-02 | 4 | 1922 |
| 5 | 2013-01-02 | 5 | 1903 |
| 6 | 2013-01-02 | 6 | 2143 |
| 7 | 2013-01-02 | 7 | 1874 |
| 8 | 2013-01-02 | 8 | 3250 |
| 9 | 2013-01-02 | 9 | 2940 |
The transactions data contains information about the number of transactions that occurred at a particular store on a particular date. This would be correlated to sales, but transactions are integer numbers. They can be considered to be the number of invoices generated on that day. We can parse the date column.
trans['date'] = pd.to_datetime(trans['date'])
trans.head(10)
| date | store_nbr | transactions | |
|---|---|---|---|
| 0 | 2013-01-01 | 25 | 770 |
| 1 | 2013-01-02 | 1 | 2111 |
| 2 | 2013-01-02 | 2 | 2358 |
| 3 | 2013-01-02 | 3 | 3487 |
| 4 | 2013-01-02 | 4 | 1922 |
| 5 | 2013-01-02 | 5 | 1903 |
| 6 | 2013-01-02 | 6 | 2143 |
| 7 | 2013-01-02 | 7 | 1874 |
| 8 | 2013-01-02 | 8 | 3250 |
| 9 | 2013-01-02 | 9 | 2940 |
We can look at the transactions through the dates for the different stores.
fig = px.line(trans.sort_values(["store_nbr", "date"]), x='date', y='transactions', color='store_nbr',\
title='Transactions')
fig.show()
Double clicking on a store number isolates the transactions time series for that store. We can see some patterns in these plots.
In order to check these trends better, we can increase granularity by taking the mean of transactions for all the stores and over each month.
df = freq_resample(trans, freq='M', col='transactions')
fig = px.line(df.sort_values('date'), x='date', y='transactions', title='Averaged transactions over all stores per month')
fig.show()
This plot helps us view the trends better. There are clear peaks during the months of December.
Next, we can see the shopping patterns based on day of the week.
df = trans.copy()
df["year"] = df.date.dt.year
df["dayofweek"] = df.date.dt.dayofweek+1
df = df.groupby(["year", "dayofweek"])['transactions'].mean().reset_index()
fig = px.line(df, x="dayofweek", y="transactions" , color = "year", title = "Averaged transactions over day of the week for all stores")
fig.show()
It's clear that majority of the shopping occurs on the weekends, while Thursday is generally a slow day.
Next, we look at the oil data.
oil = pd.read_csv('oil.csv')
oil.head()
| date | dcoilwtico | |
|---|---|---|
| 0 | 1/1/2013 | NaN |
| 1 | 1/2/2013 | 93.14 |
| 2 | 1/3/2013 | 92.97 |
| 3 | 1/4/2013 | 93.12 |
| 4 | 1/7/2013 | 93.20 |
The oil data contains the daily oil price. This is given since the country is oil dependent so its economical health is vulnerable to shocks in oil prices. Looking at the first row, there seem to be missing values in the data. Let us have a look.
oil['date'] = pd.to_datetime(oil['date'])
oil.isnull().sum()
date 0 dcoilwtico 43 dtype: int64
There are 43 missing values. Since this is a daily data, we can use interpolation to fill in the missing values. The first row value can be filled using backfill method. Let us also rename the column to 'price'.
oil.rename(columns={'dcoilwtico':'price'}, inplace=True)
oil['price'] = oil['price'].interpolate()
print(oil.isnull().sum(),'\n')
oil['price'].fillna(method='bfill', inplace=True)
print(oil.isnull().sum(),'\n')
date 0 price 1 dtype: int64 date 0 price 0 dtype: int64
oil.head(10)
| date | price | |
|---|---|---|
| 0 | 2013-01-01 | 93.14 |
| 1 | 2013-01-02 | 93.14 |
| 2 | 2013-01-03 | 92.97 |
| 3 | 2013-01-04 | 93.12 |
| 4 | 2013-01-07 | 93.20 |
| 5 | 2013-01-08 | 93.21 |
| 6 | 2013-01-09 | 93.08 |
| 7 | 2013-01-10 | 93.81 |
| 8 | 2013-01-11 | 93.60 |
| 9 | 2013-01-14 | 94.27 |
We can now have a look at the price trends.
fig = px.line(oil, x='date', y='price', title='Daily oil price')
fig.show()
It is clear that the oil price fell down drastically in 2016. We are given that Ecuador's economy is highly dependent on the oil price. We can check this claim by performing a correlation analysis between oil price and transactions or sales.
# a = transactions.copy()
# a = a.groupby(a['date'].dt.date)['transactions'].sum().reset_index()
# # display(a.head())
# b = train.copy()
# b = b.groupby(b['date'].dt.date)['sales'].sum().reset_index()
# # display(b.head())
# c = oil.copy()
# c.rename(columns={'price':'oil_price'}, inplace=True)
# # display(c.head())
a = freq_resample(trans, freq='D', col='transactions')
b = freq_resample(train, freq='D', col='sales')
c = oil.rename(columns={'price':'oil_price'})
d = a.merge(b, on='date', how='inner')
d['date'] = pd.to_datetime(d['date'])
d = c.merge(d, on='date', how='inner')
d['date'] = pd.to_datetime(d['date'])
display(d.head())
| date | oil_price | transactions | sales | |
|---|---|---|---|---|
| 0 | 2013-01-01 | 93.14 | 770.000000 | 1.409438 |
| 1 | 2013-01-02 | 93.14 | 2026.413043 | 278.390807 |
| 2 | 2013-01-03 | 92.97 | 1706.608696 | 202.840197 |
| 3 | 2013-01-04 | 93.12 | 1706.391304 | 198.911154 |
| 4 | 2013-01-07 | 93.20 | 1643.413043 | 188.621100 |
d.isnull().sum()
date 0 oil_price 0 transactions 0 sales 0 dtype: int64
fig = make_subplots(rows=3, cols=1)
fig.add_trace(go.Scatter(x=d['date'], y=d['oil_price'], name='Daily oil price', mode='lines'),
row=1, col=1)
fig.add_trace(go.Scatter(x=d['date'], y=d['transactions'], name='Daily total transactions', mode='lines'),
row=2, col=1)
fig.add_trace(go.Scatter(x=d['date'], y=d['sales'], name='Daily total sales', mode='lines'),
row=3, col=1)
fig.update_layout(height=1000, width=1000)
fig.show()
While the correlation isn't too apparent, we can draw some inferences based on the peaks in sale values. As the price of oil fell down near the end of 2016, the sales values increased from 2016. We can calculate a correlation coefficient to test our hypothesis.
d.corr(method='spearman')
| oil_price | transactions | sales | |
|---|---|---|---|
| oil_price | 1.000000 | 0.274351 | -0.652512 |
| transactions | 0.274351 | 1.000000 | 0.134563 |
| sales | -0.652512 | 0.134563 | 1.000000 |
The correlation coefficient between oil price and transactions is about 0.27 while that between oil price and sales is about -0.65. Thus, oil price is indeed negatively correlated with store sales, but it does not have much relation with transactions. We can say that the more oil price reduces, the cheaper shopping items become, leading to more sales. It must also be noted that transactions and sales aren't highly correlated as well.
Before moving on to the holidays and events data, we can start with analyzing the time series based on what we have. We need to have a stationary time series before we can build any forecasting models. A stationary time series has 4 components:
For now, we can treat the daily resampled sales data to be the initial time series and work on forecasting based on its properties.
df = freq_resample(train, freq='W', col='sales')
df.head()
| date | sales | |
|---|---|---|
| 0 | 2013-01-06 | 206.843478 |
| 1 | 2013-01-13 | 190.285220 |
| 2 | 2013-01-20 | 189.835452 |
| 3 | 2013-01-27 | 182.152050 |
| 4 | 2013-02-03 | 198.564267 |
Let us also convert the index to datetime, since it makes it easier for further calculations.
df.set_index('date', inplace=True)
df.head()
| sales | |
|---|---|
| date | |
| 2013-01-06 | 206.843478 |
| 2013-01-13 | 190.285220 |
| 2013-01-20 | 189.835452 |
| 2013-01-27 | 182.152050 |
| 2013-02-03 | 198.564267 |
Let us write a function that outputs all the results from an ad-fuller test, which is used to test whether a series is non-stationary.
def adf_test(timeseries):
#Perform Dickey-Fuller test:
print ('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)
fig = px.line(df, y='sales', title='Averaged Weekly Sales over all Stores')
fig.show()
Let us use ARMA, ARIMA, SARIMA models. We need to split the dataset into train and test. We can keep the data of 2017 for testing, while the rest of the data can be used for training.
df_train = df[df.index < pd.to_datetime('2017-01-01')]
df_test = df[df.index > pd.to_datetime('2017-01-01')]
print(df_train.shape)
print(df_test.shape)
(208, 1) (33, 1)
y = df_train['sales']
model = ARIMA(y, order = (20, 1, 2))
model = model.fit()
y_pred = model.get_forecast(len(df_test.index))
y_pred_df = y_pred.conf_int(alpha = 0.05)
y_pred_df["Predictions"] = model.predict(start = y_pred_df.index[0], end = y_pred_df.index[-1])
y_pred_df.index = df_test.index
y_pred_out = y_pred_df["Predictions"]
fig = plt.figure()
fig.set_size_inches(15, 10)
plt.plot(df_train, color='blue', label='Train')
plt.plot(df_test, color='red', label='Test')
plt.plot(y_pred_out, color='green', label = 'Predictions')
plt.legend(fontsize=20)
fig.suptitle('Time Series Forecasting of Store Sales', fontsize=30)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Sales', fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
C:\Users\tejas\anaconda3\envs\coursera\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency W-SUN will be used. C:\Users\tejas\anaconda3\envs\coursera\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency W-SUN will be used. C:\Users\tejas\anaconda3\envs\coursera\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency W-SUN will be used. C:\Users\tejas\anaconda3\envs\coursera\lib\site-packages\statsmodels\base\model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
(array([100., 200., 300., 400., 500., 600., 700.]), [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
In order to take this one step forward, we can look at a single store and a single product family. If we are able to set up our model to be robust enough for all stores and product families, we can perform predictions for the entire test set.
stores = train['store_nbr'].unique()
stores.sort()
print(stores,'\n')
families = train['family'].unique()
families.sort()
print(families)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54] ['AUTOMOTIVE' 'BABY CARE' 'BEAUTY' 'BEVERAGES' 'BOOKS' 'BREAD/BAKERY' 'CELEBRATION' 'CLEANING' 'DAIRY' 'DELI' 'EGGS' 'FROZEN FOODS' 'GROCERY I' 'GROCERY II' 'HARDWARE' 'HOME AND KITCHEN I' 'HOME AND KITCHEN II' 'HOME APPLIANCES' 'HOME CARE' 'LADIESWEAR' 'LAWN AND GARDEN' 'LINGERIE' 'LIQUOR,WINE,BEER' 'MAGAZINES' 'MEATS' 'PERSONAL CARE' 'PET SUPPLIES' 'PLAYERS AND ELECTRONICS' 'POULTRY' 'PREPARED FOODS' 'PRODUCE' 'SCHOOL AND OFFICE SUPPLIES' 'SEAFOOD']
def sales_subset(store, family, plot_sales=False):
df = train.loc[(train['store_nbr']==store) & (train['family']==family),['date','sales']]
df.set_index('date', inplace=True)
df = df.asfreq('d')
df.fillna(0, inplace=True)
if plot_sales == True:
fig = px.line(df, y='sales', title=f'Sales from Store {store} within Product Family {family}')
fig.show()
return df
train.loc[(train['store_nbr']==stores[0]) & (train['family']==families[0]),:]
| date | store_nbr | family | sales | onpromotion | city | state | type | cluster | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-01 | 1 | AUTOMOTIVE | 0.0 | 0 | Quito | Pichincha | D | 13 |
| 1782 | 2013-01-02 | 1 | AUTOMOTIVE | 2.0 | 0 | Quito | Pichincha | D | 13 |
| 3564 | 2013-01-03 | 1 | AUTOMOTIVE | 3.0 | 0 | Quito | Pichincha | D | 13 |
| 5346 | 2013-01-04 | 1 | AUTOMOTIVE | 3.0 | 0 | Quito | Pichincha | D | 13 |
| 7128 | 2013-01-05 | 1 | AUTOMOTIVE | 5.0 | 0 | Quito | Pichincha | D | 13 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2991978 | 2017-08-11 | 1 | AUTOMOTIVE | 1.0 | 0 | Quito | Pichincha | D | 13 |
| 2993760 | 2017-08-12 | 1 | AUTOMOTIVE | 6.0 | 0 | Quito | Pichincha | D | 13 |
| 2995542 | 2017-08-13 | 1 | AUTOMOTIVE | 1.0 | 0 | Quito | Pichincha | D | 13 |
| 2997324 | 2017-08-14 | 1 | AUTOMOTIVE | 1.0 | 0 | Quito | Pichincha | D | 13 |
| 2999106 | 2017-08-15 | 1 | AUTOMOTIVE | 4.0 | 0 | Quito | Pichincha | D | 13 |
1684 rows × 9 columns
df = sales_subset(stores[0], families[0], plot_sales=False)
fig, ax = plt.subplots(2,1,figsize=(15,10))
plot_acf(df, lags=60, ax=ax[0], zero=False);
plot_pacf(df, lags=60, ax=ax[1], zero=False);
adf_test(df)
Results of Dickey-Fuller Test: Test Statistic -3.877970 p-value 0.002203 #Lags Used 24.000000 Number of Observations Used 1663.000000 Critical Value (1%) -3.434288 Critical Value (5%) -2.863280 Critical Value (10%) -2.567696 dtype: float64
The test statistic is less than critical values, and the p-value is less than 0.05. Thus, according to the adf test, our time series is stationary.
model = SARIMAX(df, order=(1,0,1), seasonal_order=(1,1,2,7), trend='c').fit()
result = model.predict(start=pd.to_datetime('2017-01-01'), dynamic=False)
result = pd.DataFrame(result)
result.rename(columns={'predicted_mean':'pred'}, inplace=True)
result['actual'] = df['2017':]
rmse = np.sqrt(mean_squared_error(result['actual'],result['pred']))
print('RMSE:',rmse)
RMSE: 2.7111257696954127
result = model.predict(start=pd.to_datetime('2017-08-16'), end=pd.to_datetime('2017-08-31'), dynamic=False)
result = pd.DataFrame(result)
result.rename(columns={'predicted_mean':'pred'}, inplace=True)
result
| pred | |
|---|---|
| 2017-08-16 | 4.717261 |
| 2017-08-17 | 4.516105 |
| 2017-08-18 | 5.486206 |
| 2017-08-19 | 5.044160 |
| 2017-08-20 | 2.764200 |
| 2017-08-21 | 4.746283 |
| 2017-08-22 | 4.829707 |
| 2017-08-23 | 5.003198 |
| 2017-08-24 | 4.595946 |
| 2017-08-25 | 4.944141 |
| 2017-08-26 | 5.278528 |
| 2017-08-27 | 2.609867 |
| 2017-08-28 | 4.484599 |
| 2017-08-29 | 5.166893 |
| 2017-08-30 | 4.918654 |
| 2017-08-31 | 4.715717 |
Thus, we see that we can make predictions for every store and product family. For now, we can set a general SARIMAX model, keep track of RMSE and predicted sales. After running the code completely, we can look at edge cases and figure out individually the problems.
# rmse = pd.read_csv('rmse_train.csv')
# rmse.set_index(['store','family'], inplace=True)
# a = pd.read_csv('test_pred.csv')
total_len = len(stores)*len(families)
rmse = pd.DataFrame({'store':stores.repeat(len(families)).tolist(),
'family':np.concatenate([families]*len(stores),axis=0).tolist(),
'error':np.zeros(total_len)})
rmse.set_index(['store','family'], inplace=True)
a = test.copy()
for i in range(len(stores)):
store = stores[i]
for j in range(len(families)):
family = families[j]
df = train.loc[(train['store_nbr']==store) & (train['family']==family),['date','sales']]
df = df[df['sales'] > 0]
df.set_index('date', inplace=True)
df = df.asfreq('d')
df.fillna(0, inplace=True)
model = SARIMAX(df, order=(1,0,1), seasonal_order=(3,1,3,7), trend='c').fit()
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=UserWarning)
# Calculate rmse on validation set
result = model.predict(start=pd.to_datetime('2017-01-01'), dynamic=False)
result = pd.DataFrame(result)
result['actual'] = df['2017':]
rmse.loc[store,family] = np.sqrt(mean_squared_error(result['actual'],result['predicted_mean']))
# Calculate predictions on test dataset
result = model.predict(start=pd.to_datetime('2017-08-16'), end=pd.to_datetime('2017-08-31'), dynamic=False)
result = pd.DataFrame(result)
a.loc[(a['store_nbr']==store) & (a['family']==family), 'sales'] = result['predicted_mean'].values
print('Store Number: ', store)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Input In [7], in <cell line: 9>() 14 df = df[df['sales'] > 0] 15 df.set_index('date', inplace=True) ---> 16 df = df.asfreq('d') 17 df.fillna(0, inplace=True) 19 model = SARIMAX(df, order=(1,0,1), seasonal_order=(3,1,3,7), trend='c').fit() File ~\anaconda3\envs\coursera\lib\site-packages\pandas\core\frame.py:10517, in DataFrame.asfreq(self, freq, method, how, normalize, fill_value) 10508 @doc(NDFrame.asfreq, **_shared_doc_kwargs) 10509 def asfreq( 10510 self, (...) 10515 fill_value=None, 10516 ) -> DataFrame: > 10517 return super().asfreq( 10518 freq=freq, 10519 method=method, 10520 how=how, 10521 normalize=normalize, 10522 fill_value=fill_value, 10523 ) File ~\anaconda3\envs\coursera\lib\site-packages\pandas\core\generic.py:7697, in NDFrame.asfreq(self, freq, method, how, normalize, fill_value) 7590 """ 7591 Convert time series to specified frequency. 7592 (...) 7693 2000-01-01 00:03:00 3.0 7694 """ 7695 from pandas.core.resample import asfreq -> 7697 return asfreq( 7698 self, 7699 freq, 7700 method=method, 7701 how=how, 7702 normalize=normalize, 7703 fill_value=fill_value, 7704 ) File ~\anaconda3\envs\coursera\lib\site-packages\pandas\core\resample.py:2092, in asfreq(obj, freq, method, how, normalize, fill_value) 2089 elif len(obj.index) == 0: 2090 new_obj = obj.copy() -> 2092 new_obj.index = _asfreq_compat(obj.index, freq) 2093 else: 2094 dti = date_range(obj.index.min(), obj.index.max(), freq=freq) File ~\anaconda3\envs\coursera\lib\site-packages\pandas\core\resample.py:2129, in _asfreq_compat(index, freq) 2127 new_index = TimedeltaIndex([], dtype=index.dtype, freq=freq, name=index.name) 2128 else: # pragma: no cover -> 2129 raise TypeError(type(index)) 2130 return new_index TypeError: <class 'pandas.core.indexes.base.Index'>
rmse.to_csv('rmse_train.csv')
a.to_csv('test_pred.csv')
a.sort_values(by='sales',ascending=True)
| Unnamed: 0 | id | date | store_nbr | family | onpromotion | sales | |
|---|---|---|---|---|---|---|---|
| 11404 | 11404 | 3012292 | 2017-08-22 | 29 | LADIESWEAR | 0 | 0.000000 |
| 20002 | 20002 | 3020890 | 2017-08-27 | 20 | BOOKS | 0 | 0.000000 |
| 23299 | 23299 | 3024187 | 2017-08-29 | 13 | BABY CARE | 0 | 0.000000 |
| 15805 | 15805 | 3016693 | 2017-08-24 | 51 | SCHOOL AND OFFICE SUPPLIES | 0 | 0.000000 |
| 5812 | 5812 | 3006700 | 2017-08-19 | 22 | BOOKS | 0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 20892 | 20892 | 3021780 | 2017-08-27 | 45 | BEVERAGES | 44 | 15076.754185 |
| 8460 | 8460 | 3009348 | 2017-08-20 | 46 | GROCERY I | 39 | 15110.022900 |
| 20934 | 20934 | 3021822 | 2017-08-27 | 46 | GROCERY I | 65 | 15386.823700 |
| 8493 | 8493 | 3009381 | 2017-08-20 | 47 | GROCERY I | 41 | 15525.989636 |
| 20967 | 20967 | 3021855 | 2017-08-27 | 47 | GROCERY I | 58 | 15701.724885 |
28512 rows × 7 columns
rmse.sort_values(by='error', ascending=False)
| error | ||
|---|---|---|
| store | family | |
| 46 | GROCERY I | 2669.809855 |
| 45 | GROCERY I | 2582.597331 |
| 9 | GROCERY I | 2482.416193 |
| 44 | BEVERAGES | 2455.476134 |
| 48 | GROCERY I | 2329.163602 |
| ... | ... | ... |
| 35 | LADIESWEAR | 0.000004 |
| 13 | BABY CARE | 0.000004 |
| 49 | BABY CARE | 0.000004 |
| 52 | BOOKS | 0.000004 |
| 13 | BOOKS | 0.000004 |
1782 rows × 1 columns
We see that we have some predictions that have high rmse values. We can now look at a case by case basis on the underlying reason.
store = 46
family = 'GROCERY I'
df = train.loc[(train['store_nbr']==store) & (train['family']==family),['date','sales']]
df = df[df['sales'] > 0]
df.set_index('date', inplace=True)
df = df.asfreq('d')
df.fillna(0, inplace=True)
px.line(df, y='sales')
model = SARIMAX(df, order=(1,0,1), seasonal_order=(3,1,3,7), trend='c').fit()
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=UserWarning)
# Calculate rmse on validation set
result = model.predict(start=pd.to_datetime('2017-01-01'), dynamic=False)
result = pd.DataFrame(result)
result['actual'] = df['2017':]
np.sqrt(mean_squared_error(result['actual'],result['predicted_mean']))
2538.391271358647